-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] support iceberg metadata table #33186
Conversation
Signed-off-by: stephen <[email protected]>
Conversions.fromByteBuffer(idToTypeMapping.get(entry.getKey()), entry.getValue())))); | ||
return GsonUtils.GSON.toJson(values); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some suggestions for the given code:
-
Error Handling: In the
getFiles
method, theCloseableIterator<FileScanTask> fileScanTaskIterator
is closed inside a try-catch with a catch block that ignores anyIOException
. Consider at least logging the error message or almost certainly handling it in some way since failing to close resources can lead to memory leaks. -
Conciseness and Readability: Some methods (e.g.,
getLowerBounds
andgetUpperBounds
) seem almost duplicated; you could consider refactoring them into one generic method to make the code more readable and reduce redundancy. -
Testing: From an overall perspective, this class and its methods seem complex. Depending on your testing strategy, many edge cases may need to be handled, so ensure appropriate unit tests are created wherever required.
-
Commenting: Although the naming conventions you used provide some context about what your code is supposed to do, adding comments for non-trivial sections of the code would improve readability.
-
Null Checking: Check for null values where relevant before invoking methods. An instance is where
fileScanTaskIterator.next()
is called. If the iterator does not have a next element, it could throw aNoSuchElementException
. You're usingfileScanTasks.hasNext()
which should prevent this but the next line usesfileScanTaskIterator.next()
.
Please note that these suggestions might depend on the bigger context and coding guidelines you follow in your project.
return fill_chunk(chunk); | ||
} | ||
|
||
} // namespace starrocks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code seems generally well-structured and follows good practices. I do have a few suggestions, however:
-
Error messages could be more descriptive: In functions like
get_next()
orstart()
, instead of just mentioning'Used before initialized'
or'input pointer is null'
you can provide the function name as well to make debugging easier. -
Avoid magic numbers: You might consider defining constants for each case in
fill_chunk(ChunkPtr* chunk)
. Magic numbers like 1, 2, 3...14 are not immediately clear on what they signify. -
Repeated Code: Inside your
fill_chunk
function's switch statement, the cases where you're dealing with string fields look repetitive. Consider creating a helper method to handle varchar types where you just pass the file attribute. -
Addressing possible empty strings: For varchar fields, it's a common practice to first check if the string is not empty before proceeding with other operations. Consider adding this logic to further bullet-proof your code.
-
Variable Naming, Explicitness: In the
start()
function,_files_index = 0;
would make better sense just after_files_res
has been loaded. Also, using meaningful naming (likecurrent_file_index
) could improve readability. -
Exception Handling: Consider adding exception handling or checking where
RETURN_IF_ERROR
macro is used. You may want to handle exceptions depending on the seriousness of the business impact.
Again, these are slight improvements that can be made over already good quality code.
return fill_chunk(chunk); | ||
} | ||
|
||
} // namespace starrocks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code you've presented seems to be well-structured and formatted, which is good. Here are a few suggestions that may enhance the quality and readablility of the code:
-
Avoid
using namespace
in headers: If any headers [implicitly] used have ausing namespace
statement, it's better to avoid or remove them, since it can lead to name conflicts. -
Put magic numbers into a constant: In the method
fill_chunk(ChunkPtr* chunk)
, the switch case statements use numerical values from 1 to 11. These could be replaced with descriptive enumeration or constants for better readability. -
Redundant code blocks: There are redundant brackets
{...}
in the cases of the switch statement in the methodfill_chunk()
. The extra scope created by these brackets is not necessary and can therefore be removed, enhancing overall readability. -
Error message context: When returning an error status such as
Status::InternalError("Used before initialized.")
, you might want to add context to help with debugging, for example, including the variable name or function where the error originated. -
Commenting: Although your methods likely have self-explanatory names, adding comments explaining what they do can improve readability.
-
Use safer C++ casts: Instead of using C-style casting
(void*)
, usestatic_cast<void*>
. They're safer because C++ style casts allow the compiler to check type information when the cast is performed. -
Check for null pointers: Where applicable, you could check for nullity before dereferencing pointers. This could prevent potential segmentation faults.
-
Safe division: Be aware when performing the division when initializing
SchemaScanner
. Ensure the denominator isn't zero. -
Input validation: Validate inputs where reasonable, throw or return meaningful exceptions or errors back to the user. For example, in the
start()
method before accessing_param->catalog
, validate if_param
is a valid pointer. -
Potential memory leak: If you've allocated memory during certain operations but didn't release it properly when you're done, a memory leak could occur. It's hard to tell with the given code, so make sure you handle that correctly if you've reserved memory on the heap somewhere outside of the provided code.
Note: Some recommendations might seem not critical because the context or the overall project structure is unknown. The code you provided only contains part of a class implementation; suggestions might change depending on the missing parts or the overall project conventions and standards.
Review the suggestions considering the specificity of your software project.
return fill_chunk(chunk); | ||
} | ||
|
||
} // namespace starrocks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, the code is quite clear and easy to understand. However, there are a few improvements and suggestions that can be made:
-
Magic Numbers: Instead of using magic numbers directly in the
fill_chunk
method (e.g.,case 1
,case 2
, etc.), you can use named constants. This can increase readability and maintainability. -
Error Messages: The error messages could be more descriptive. For instance,
"IP or port doesn't exist"
can specify which one doesn't exist. -
Null Check: In the
start
function, there is an extensive usage ofnullptr
checks before setting the request parameters. If_param
is really valid, consider refactoring the null check into a separate validation method, checking all fields at once. -
Comments: While the current comments provide some information (such as for
committed_at
), it would be helpful to have comments for other parts of the code as well, especially for complex/important calculations or logic. -
Code Duplication: There's some repeated code involving creating a Slice object and filling the column with its data. You might be able to make this a single function call with the necessary parameters passed in.
-
Default Function Statements: The declaration
SchemaIcebergSnapshotsScanner::~SchemaIcebergSnapshotsScanner() = default;
could be removed if it's not doing anything beyond the compiler-generated destructor. -
Namespace Usage: Consider using explicit namespace (
starrocks::TypeName
) instead of using clause (using namespace starrocks
). It better communicates where identifiers come from and also prevents symbol clashing. -
Consistent Naming Convention: Ensure that the naming convention is consistent across the entire project. For example, some function names are 'snake_case' while others are 'camelCase'.
The possibility of implementing these changes depends on the wider context of your code base and the standards of your specific project.
} | ||
return res.toString(); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this code is well-written with clear responsibilities divided into methods. However, there are a few potential improvements you could make:
-
Avoid Magic Numbers: In the getPartitionSummaries method, consider avoiding the use of the magic number
i
in your loop and opt instead for a more descriptive variable name tailored to its function.for (int partitionIndex = 0; partitionIndex < summaries.size(); partitionIndex++) {..}
-
Use Optional efficiently:
- Instead of having an explicit condition check
if (!icebergTable.getSnapshot().isPresent()) { return res; }
, you could benefit from usingOptional.ifPresent
.
icebergTable.getSnapshot().ifPresent(snapshot -> { List<ManifestFile> manifestFiles = snapshot.allManifests(icebergTable.getNativeTable().io()); // Rest of logic here... });
- Instead of having an explicit condition check
-
Ensure Thread Safety: If multiple threads can access or modify
ConnectorTableId.CONNECTOR_ID_GENERATOR
, ensure that it’s thread-safe. Undefined behavior or data races could happen if one thread modifies the CONNECTOR_ID_GENERATOR while another uses it at the same time. -
Error Handling: How does your code handle exceptions? Currently there's no
try-catch
blocks or documentations to indicate what exceptions may be thrown and how they are handled. Pay particular attention to possible null values, missing fields, I/O operations, file not found errors, etc. -
Code Comments: While the code is generally clean, providing comments giving a brief rundown on what certain sections of the code do would increase maintainability and readability. This is especially important in the
getManifests()
andgetPartitionSummaries()
methods which contain some potentially complex logic. -
Immutability:
tIcebergManifests
ingetManifests()
can be made final to enhance code safety.final List<TIcebergManifest> tIcebergManifests = new ArrayList<>();
Do remember, these are suggestions and practices which might enhance the maintainability, readability, and quality of your code in general. The current code you provided is organized pretty well and does not seem to have any major issues at first glance.
SonarCloud Quality Gate failed. 1 Bug 0.0% Coverage Catch issues before they fail your Quality Gate with our IDE extension SonarLint |
[FE Incremental Coverage Report]😞 fail : 30 / 239 (12.55%) file detail
|
[BE Incremental Coverage Report]😞 fail : 0 / 241 (00.00%) file detail
|
@stephen-shelby |
Fixes #issue
support following sql:
select * from table_name$snapshots
select * from table_name$manifests
select * from table_name$files
select * from orders_test$snapshots order by committed_at desc limit 1 \G
*************************** 1. row ***************************
committed_at: 2023-10-20 01:42:23
snapshot_id: 4924636944729424676
parent_id: 7219695656029995314
operation: append
manifest_list: hdfs://e/orders_test-f70fff62928f4257860d7/metadata/snap-4924636944729424676-1.avro
summary: {"added-data-files":"2406","added-records":"1500000","added-files-size":"39169690","changed-partition-count":"2406","total-records":"1500000","total-files-size":"39169690","total-data-files":"2406","total-delete-files":"0","total-position-deletes":"0","total-equality-deletes":"0"}
select * from orders_test$manifests\G
*************************** 1. row ***************************
path: hdfs://ss/metadata/5d6e9209-07dc-435d-bb28-67e058c101a6-m0.avro
length: 252203
partition_spec_id: 0
added_snapshot_id: 4924636944729424676
added_data_files_count: 2406
added_rows_count: 1500000
existing_data_files_count: 0
existing_rows_count: 0
deleted_data_files_count: 0
deleted_rows_count: 0
partition_summaries: [{"contains_null":false,"contains_nan":false,"lower_bound":"1992-01-01","upper_bound":"1998-08-02"},{"contains_null":false,"contains_nan":false,"lower_bound":"0","upper_bound":"0"}]
select * from orders_test$files limit 1\G
*************************** 1. row ***************************
content: 0
file_path: hdfs://xx/data/o_orderdate=1992-12-15/o_shippriority=0/xxxxx.orc
file_format: ORC
spec_id: 0
record_count: 628
file_size_in_bytes: 16650
column_sizes:
value_counts: {"1":628,"2":628,"3":628,"4":628,"5":628,"6":628,"7":628,"8":628,"9":628}
null_value_counts: {"1":0,"2":0,"3":0,"4":0,"5":0,"6":0,"7":0,"8":0,"9":0}
nan_value_counts:
lower_bounds: {"1":"11552","2":"386","3":"F","4":"981","5":"1992-12-15","6":"1- URGENT","7":"Clerk#000000002","8":"0","9":" Tiresias lose b"}
upper_bounds: {"1":"5962919","2":"149743","3":"F","4":"460642","5":"1992-12-15","6":"5-LOW","7":"Clerk#000000999","8":"0","9":"yly. pending, ew"}
split_offsets:
equality_ids:
find max value of partition column
select max(get_json_string(partition_summaries, '$[0].upper_bound')) from orders_test$manifests;
+---------------------------------------------------------------+
| max(get_json_string(partition_summaries, '$[0].upper_bound')) |
+---------------------------------------------------------------+
| 1998-08-02 |
+---------------------------------------------------------------+
find min value of non-partition column
select min(get_json_string(lower_bounds, '$.2')) from orders_test$files;
+-------------------------------------------+
| min(get_json_string(lower_bounds, '$.2')) |
+-------------------------------------------+
| 1 |
+-------------------------------------------+
What type of PR is this:
Does this PR entail a change in behavior?
If yes, please specify the type of change:
Checklist:
Bugfix cherry-pick branch check: